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ABSTRACT 

We developed a dataflow framework which provides a basis for rigorously defining strate- 
gies to make use of runtime preprocessing methods for distributed memory multiprocessors. 

In many programs, several loops access the same off-processor memory locations. Our 
runtime support gives us a mechanism for tracking and reusing copies of off-processor data. 
A key aspect of our compiler analysis strategy is to determine when it is safe to reuse copies of 
off-processor data. Another crucial function of the compiler analysis is to identify situations 
which allow runtime preprocessing overheads to be amortized. This dataflow analysis will 
make it possible to effectively use the results of interprocedural analysis in our efforts to 
reduce interprocessor communication and the need for runtime preprocessing. 


'Research was supported by the National Aeronautics and Space Administration under NASA Contract 
No. NAS1-18605 while the authors were in residence at the Institute for Computer Applications in Science 
and Engineering (ICASE), NASA Langley Research Center, Hampton, VA 23665. 
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1 Introduction 

We present a dataflow framework that can be employed to systematically make use of runtime 
preprocessing methods aimed at loops in which some array references are made through a 
level of indirection. The dataflow framework we present here pertains to collections of 
loops with no loop-carried data dependences or with only accumulation type dependencies. 
Such loops are often referred to as data-parallel loops, and are the primary target of the 
Fortran D compiler [4]. The type of irregular loops we are trying to handle are typically 
found in unstructured mesh explicit and multigrid solvers, molecular dynamics codes, and 
some sparse iterative linear systems solvers. 

In distributed memory machines, large data arrays need to be partitioned between lo- 
cal memories of processors. These partitioned data arrays are called distributed arrays. 
Long term storage of distributed array data is assigned to specific memory locations in the 
distributed machine. In many irregular problems, we can reduce the amount of data to be 
communicated by using a partitioning algorithm that individually assigns each array element 
to a specific processor. Furthermore, these machines often have a non-trivial communications 
latency or startup cost. Therefore, efficiency demands that information to be transmitted 
should be collected into relatively large messages; this in turn implies that the elements to 
be sent and received by each processor should be precomputed. In irregular problems, the 
communications pattern depends on the input data, typically because of some indirection 
in the code. In this case, it is not possible to predict at compile time what data must be 
prefetched. 

We treat this lack of information by transforming the original parallel loop into two 
constructs called an inspector and executor [6, 7]. During program execution, the inspector 
examines the data references made by a processor, and calculates what off-processor data 
needs to be fetched and where that data will be stored once it is received. The executor loop 
then uses the information from the inspector to implement the actual computation. The 
Fortran D compiler now under development at Rice University performs these tasks by calls 
to the PARTI library built at ICASE. 

The PARTI primitives (Parallel Automated Runtime Toolkit at ICASE) were designed 
to ease the implementation of irregular computational problems on parallel architecture 
machines by relieving the user or compiler writer of having to deal with many low-level 
issues. These procedures 

1. Coordinate interprocessor data movement, 

2. Manage the storage of and access to copies of off-processor data, and 

3. Support a shared name space by building a distributed translation table [9] to store 
the local address and processor number for each distributed array element. 

This functionality can be used directly to generate inspector/executor pairs. Each inspector 
produces a communications schedule , which is essentially a pattern of communication for 
gathering or scattering data. Hash tables are used to avoid repeatedly communicating the 
same array elements. The executor has embedded PARTI primitives to gather or scatter data. 
The primitives are designed so that the final parallel code remains as close as possible to the 
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original sequential code. The primitives issue instructions to gather, scatter or accumulate 
(e.g. scatter followed by add) data according to a specified schedule. The latency or start-up 
cost is reduced by packing several small messages with the same destinations into one large 
message. Significant work has gone into optimizing the gather, scatter and accumulation 
communication routines for the iPSC/86Q. It is not the purpose of this paper to describe 
the design and implementation of PARTI in great detail; information on this can be found 
elsewhere [1], 

Our runtime support makes it possible to track and reuse off-processor data copies [2]. 
We generate incremental communications schedules to obtain only those off-processor data 
not requested by a given set of pre-existing schedules. This gives us the runtime support we 
need to combine and hoist gather, scatter and accumulate procedures. Removal of duplicates 
is achieved by using a hash table. In a mesh solver, for example, the off-processor data to 
be accessed by the edge schedule is first hashed using a simple hash function. Next all the 
data to be accessed during the faceJoop is hashed. At this point the information that exists 
in the hash table allows us to remove all the duplicates and form the incremental schedule. 

The data flow' framework to be described here aims at providing good information about 
what data are needed at which points in the code, along with information about what live 
off-processor data are available. At compile time, we compute global flow information about 
the communication characteristics of the loops around a flow graph. This framework bears 
similarities to classical techniques such as common subexpression elimination, loop invariant 
code motion, and dead code elimination. The framework provides a basis for determining at 
compile time 

• Where communication schedules are to be generated, 

• Where gather, scatter, and accumulate operations are to be placed, and 

• When incremental schedules may be employed. 

In our data flow analysis, some of the variables reflect inherent properties of the analyzed 
program, while others calculate the results of heuristics we employ in order to producing the 
gather and scatter operations. Our heuristics aim to 

• Exploit situations where we can reuse communications schedules, and to 

• Remove duplicate communications by combining and hoisting gather, scatter and ac- 
cumulate procedures. 

The rest of the paper is organized as follows. Section 2 provides some definitions and 
terminology for the framework. Section 3 introduces the local flow variables, followed by 
global variables in Section 4 and result variables in Section 5. Section 6 gives an extension 
of the framework for handling reduction operations. In Section 7, we work through the data 
flow variables for a program example, which is a simplified version of a mesh solver for which 
Section 8 gives concrete experimental results illustrating the effect of exploiting data flow 
information. Section 9 contains concluding remarks. 
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2 Basics of the Framework 


This section describes the scope of the framework developed in this paper and defines some 
concepts used in later sections. 

2.1 The domain 

Even though our implementation can handle other cases as well, we assume here for presenta- 
tion purposes that all indirect references in the program text are of the form (arr ay) ((index -array) ((looj. 
For many programs, this can actually be achieved by forward substituting array indices. For 
example, the code sequence j = ia(i) ; x(j)-10 would be treated as x(ia(i)) = 10. Arrays 
which are never referenced indirectly are assumed to be analyzed using other methods [3] 
prior to this analysis. References with multiple (but bounded) levels of indirection will re- 
quire more levels of complexity in the dataflow framework; we do not consider potentially 
unbounded indirection, as is found in linked lists. 

Let V be the set of arrays which are accessed indirectly. We assume that each reference r 
to some v £ V is contained in some loop(s). Let L be the set of loops which directly enclose 
an occurrence of some v £ V. We assume that no f G L encloses any other m £ L . One set 
of data flow variables is computed for each element of a set of nodes , N . It is N = L U P, 
where P contains one entry pad , Entry, and one exit pad, l ex n , for each loop / ^ L containing 
some /' £ L. Furthermore, we assume l en try ( Exit ) to be executed before (after) l iff / has at 
least one iteration. The framework operates on a loop flow graph G = (N, E) of the program, 
where the edges E are simple control flow edges. 

For example, if l is an outer time stepping loop which does not directly contain any 
irregular array references but which contains a loop V over mesh edges, then l' is represented 
as a node in G and / is represented as some interval in G . In the following, loop refers to 
elements of Af, z.e., it may denote a pad as well. 

Future work will present a complete framework in which summary information is built in 
a bottom-up fashion similar to array kill information [3]. Finally, this paper only discusses 
the case where the summarized loops have no data dependences, except for commutative 
and associative reductions which are handled specially. 


2.2 Array portions 

Array portions are a central concept to the framework and best introduced by an example. 
A portion x(ia(l:n)) consists of the array x and the index set ia(l:n). This index set in 
turn consists of the index array ia and the range (l:n), which has the lower bound 1 and 
the upper bound n. 

Several portions may be taken from the same array or may have the same index set. 
The index range does not have to be known at compile time, so the bounds may contain 
symbolics. No assumptions are made about whether different portions taken from the same 
array are disjoint or whether they overlap each other partially or completely. This allows 
analyzing symbolic index ranges, but it requires the analysis to be conservative when using 
intersection and set subtraction in the equations. 


3 


The framework can be implemented using bit vectors, each bit representing one array 
portion. The length of these bit vectors is bounded by the number of indirect array references 
(i.e., it is linear in program size), and all equations given here are rapid [5]. Therefore, 
using bit vectors for the analysis gives us good asymptotic running times. However, for 
our examples (and probably also in a practical implementation), it seems advantageous to 
represent the different flow variables as bit matrices. The rows of a bit matrix correspond to 
the arrays of the portions represented ( e.g . x in x(ia(l :n))), while the columns correspond 
to the index sets (ia(l :n)). Theoretically that representation increases variable sizes from 
linear in program size to quadratic in program size, so the feasibility of this approach depends 
on how programs behave in practice. However, this representation makes potential schedule 
sharing, for example, very easy to recognize by determining which index set columns have 
more than one entry. 

We assume that all indirect array references are identified in a previous pass over the 
program text and construct bit vectors/matrices accordingly. For the analysis we also assume 
that a (identity) dummy index array is inserted for all direct array references. 


2.3 Operations on portions 

To aid the distinction between portions, indirect array references, array elements, and sets 
of all these constructs, we make a short digression to introduce the conversion operators 
elements-of p (where p is some portion or set of portions), denoted p, and references-of p, 
denoted p. Assume we are given 

• an array x\ 

• an index array ia(l : 5); 

• portions p = x(ia(pi : p u )), q = x(ia(qi : q u )), r = x(ia(ri : r u )); and 

• sets of portions A = {p, qj, B — {p}, C = {r}. 

We can reason about A, B , and C at different levels. For example, if the index ranges of the 
portions are only known symbolically, one can determine at the portion level that 

AD B 

must hold, but no other relationships can be proven among the sets of portions. However, if 
we know for example that 

pi — 1 ^ Pu — 3, qi — 3, qu — 5, i'i = 3, r u = 4, 

then the elements-of operator, ~, can be applied to the portions and to the sets thereof, to 
obtain 

A = a:(fa(l : 5)), B = x(ia( 1 : 3)), C = x(fa(3 : 4)). 

The scope of ~ is extended to set operators and predicates, so we can assert at the element 
level that 

ADB, 

ADC. 
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Assume furthermore that we know the values of the index array to be 


ia( 1 : 5) = 1,4,3, 1,4. 

Then the references-of operator, ”, obtains 

A = *(1,3,4), b = *(1,3,4), C = *(1,3). 

With this knowledge, we conclude at the reference level that 

ADBbc. 

We can see how the set relationship predicates change over the different levels of reasoning, 
with 

X D Y =>• Xt bY => X DY. 

Another interesting operation in this context is set subtraction: 

• A\B = {p,q]\{p} = {<?}, which is *(1,3,4); 

• A\B — A\B = *(ia(l : 5)) \ x{ia{ 1 : 3)) = x(ia( 4,5)), which is *(1,4); 

• A\B = Z A \ £ = *(1,3,4) \ *(1,3,4) = 0. 

As described in Section 5, A \ B (and the corresponding sets at lower levels) can be viewed 
as a so called incremental schedule, which indicates what has to be communicated if A is 
needed and B is already available in local memory. We can see immediately the consequences 
for this incremental schedule in the example: the more we know about portions, the less we 
might have to communicate. Formally, 

X\Y D X\Y 2 X\Y. 

To aid formulating conservative equations which still offer the possibility to exploit any 
knowledge potentially available at compile time, we introduce some set operators which map 
sets of portions into sets of portions. Given some set of portions SET, we define 


SET 

— 

{p | p has same array as some q 6 SET}, 

SET 

— 

{p | SET might affect p) 


— 

{p | p f\SET / 0 cannot be disproven} 


c 

SET, 

SET 

= 

{p | SET contains p} 


= 

{p | pCSET can be proven} 



SET, 

SET 

= 

{p | SET might partially touch part of p] 


= 

SET \ SET. 
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SET can be derived easily from SET by just reducing a bit matrix (array names by 
index sets) to a bit column (array names) using row-wise OR. From there we can conserva- 
tively approximate SET u , SET n , and SET directly, or we can employ further compile time 
knowledge about how portions relate to each other if available. Either way, we do not leave 
the portion space as given in the program, i.e., we can still represent these sets with binary 
bit matrices. 

For example, let the portions p , q, r be defined as above, and let D = {<7}. Assuming no 
compile time knowledge at the element or reference level, we can conservatively assume that 
EE = {p, <7,r}, EE = {p, <7, r}, TP = {<7}, and EE — {p, r}. With knowledge at the element 
level, we have EE = {p, <7, r}, EP = {p,g,r}, EP = {q,r}, and EE = {p}. Reference level 
knowledge gives EE — { p, <7, r }, EP = {p, <7,r}, EP = {p, <7,r}, and D° — 0 . 

A point to keep in mind when reasoning about which elements are contained in which 
portions and how portions relate to each other is that two portions p, q might globally 
contain the same set of array elements of some array X, but that locally a given processor 
sees different parts of X for p and q. (This applies to lhs occurrences as well, since we apply 
the owner computes rule based on index array ownerships, not on data array ownerships; 
otherwise we would not need a SCATTER operation). In this case there has communication 
to occur if for example we first define p and then use q. The important consequence is that 
we must apply ~ and ~ based on the share of each processor. 

Furthermore, we have to keep different distributions of arrays and index arrays in mind 
for the analysis. For example, we cannot reuse a schedule between two portions which have 
the same index set, but whose arrays are distributed differently. For sake of simplicity, 
however, we assume in this paper that all arrays are conformable. 

3 The Local Flow Variables 

We define the local flow variables to be the components of the data flow equations which are 
determined by local analysis of each loop. In the following, 

• l stands for an arbitrary loop, 

• p denotes a portion x(ia(lb rub) ), 

• an occurrence of p is either a use of p or a definition of p, and 

• variable or flow variable stand for data flow variables. 

We begin with two variables, REF and DEF, which are familiar from standard live 
variable analysis. A point to keep in mind, however, is that here live does not refer to 
whole arrays, but to limited portions thereof instead. Also, there may be conditionals in the 
loops generating the variables, which can be handled by annotating portions with (symbolic) 
guards applying to whole portions or elements thereof. 

For each loop /, we define 

REF(l): the portions live on entry to l, and 
DEF(l): the portions defined in l. 


6 


Formally: 


REF(l) = {p | first stmt containing p in l reads p}, 

DEFfil) = {p | some stmt in l assigns to p}. 

To aid the extension to reduction statements discussed in Section 6, we do not base the 
further development of the framework on REF and DEF directly, but replace them with 
GET and PUT. These variables are used to derive the portions which have to be buffered 
locally. We define 

GET(l): the portions referenced in 1 from local memory (the buffer ), 

PUT(l): the portions written by 1 into the buffer, and 
BUF(l): the portions which will be buffered on exit from l. 

The equations (which will be redefined in Section 6): 

GET(l) = REF{1), 

PUT(l) = DEF(l), 

BUF(l) = GET(l) U PUT(l). 

We also have to compute the live ranges of index sets, otherwise we might accidentally 
try to communicate a portion before or after the program region where the index set of that 
portion is available (be., before the index set is defined or after it is overwritten with other 
values). We define 

IND(l): the portions whose index sets may be computed (in part) by l. 

KILL(l): the portions that may be made invalid by l, either because l assigns an overlapping 
part of the array or l reassigns the index set. GATHER operations can never be hoisted 
above l for these portions. 

FLUSH(l): the portions that may be read by l or whose index sets may be reassigned by 
/. SCATTER operations can never be delayed until after 1 for these portions. 


Formally: 


IND(l) = {p | P has index set ia( i-min • ijnax') 
and 1 assigns to ia), 

KILL(l) = IND(l) U DEF°(l), 

FLUSH (l) = IND(l) U REF°(l). 


4 The Global Flow Variables 

The computation of the global flow variables constitutes the meat of the data flow frame- 
work. Here we actually propagate knowledge about the communication characteristics of 
the loops around in the flow graph. The problems addressed here have elements from Com- 
mon Subexpression Elimination, Loop Invariant Code Motion, and Dead Code Elimination. 
As already mentioned, all equations given here are rapid, so we can expect to solve them 
efficiently using simple iterative techniques. All global variables are initialized to 0. 
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4.1 Fetches 


The strategy for determining where to place GATHER operations is based on the following 
definitions: 

LIVE any / a ^(l): the portions which are needed in l or in all/any of the following loops. 

BUFFD(l): the portions which are already available when entering l. Here we assume that 
buffers are not flushed unless the data in them may be invalid, because either the data 
array or the index array has been assigned to; 

HOIST(l): the portions for which a GATHER should be hoisted ahead of /. 

FETCH(l): the portions which are needed in /, or which are needed in some later loop and 
can be hoisted before l. 

The equations: 

LIVE M (l) = GET {l ) U fj ( LIVE all (s ) \ KILL(l)), 

sEsuccs(l) 

LIVE any (l ) = GET(l) U U ( LIVE any (s)\KILL(l )), 

s£succs(l) 

BUFFD(l) = BUF(l) U f) ( BUFFD(p ) \ KILL(l)), 

p€preds(l) 

HOIST (/) = p| (LIVE atl (p) U BUFFD(p)), 

p€preds(l) 

FETCH{1) = GET(l) U 

P (HOIST (s) D FETCH(s)). 

$E$uccs(l) 

At this point, we have identified candidate locations in the program for placing GATHER’s. 
In short, whenever a portion appears in a FETCH(\) set, then that portion can be GATHER’ed 
before l and will be used before it is assigned. The final placement will be determined by 
the result flow variables discussed in Section 5. 

Note that we can not only distinguish the variables defined so far by whether they are 
local or global, but we can also classify them into either reflecting fixed properties inherent 
of the analyzed program, or being subject to heuristics. Furthermore, this classification can 
be done based either on the definition of the variable, i.e., how it is defined in terms of 
other variables, or on the actual values of the variable. For example, HOIST is currently 
defined so that we combine and hoist up GATHER’s as much as possible, subject to the 
constraint that we never want to overcommunicate (even if that might be advantageous in 
some cases, for example in saving schedules). If we, for example, replace the LIVE alt in 
the definition of HOIST with LIVE any , we could hoist up communication even further, at 
the expense of possibly communicating unnecessary data. In other words, the definition of 
HOIST is a matter of heuristics , which is not the case for the other definitions so far. For 
other variables dependent on HOIST (so far, FETCH is the only such variable), their values 
become a matter of the chosen heuristics as well, but not their definition. 
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4.2 Stores 

The high level strategy for determining where to place SCATTER operations is relatively 
similar to the one for placing GATHER’s. Note that we do not have to scatter portions 
(i.e., send them back to the owner) if they are used only locally, which is why we restrict 
our attention to GET 0 instead of GET. The definitions: 

HIN any / all (l) /HOUT any / all (l): the portions touched by a reference on any/ all of the 
paths starting at the entry/exit of l. 

DELAY(l): the portions which should be scattered in a later loop, or which are dead on 
exit. 

STORE(l): portions which are assigned to in l, or which were assigned to earlier and whose 
SCATTER’s can be hoisted into l. 


HIN °"(0 
HOUT all (l) 

HlN any ( l) 
HOUT any (l) 

DELAY (l) 


= GET°(l) U HOUT all (l), 

= n nuv aii (s ), 

s£succs(l) 

= GET°(l) U HOUT any (s), 

= IJ HIN any (l ), 

s&succs(l) 

= n ( HOUT al \s ) U HOW a ny (s)) \ 

$€:succs(l) 


U FLUSH (s), 

sEsuccs(/) 

STORE(l) = PUT(l) U 

p| (DELAY (p)D STORE (p)). 

ptpreds(l) 


Our heuristic, here defined by DEL A Y, is to combine and delay SCATTER’s as much as 
possible, subject to the constraint that we never scatter data which are dead. 


5 The Result Flow Variables 

The result flow variables given in this section are computed after solving the equations given 
so far. They should accurately describe which portions have to be gathered before entering 
l or scattered after leaving l (possibly using reductions). Here we want to take previous and 
succeeding loops and their communication requirements into account as well. 


5.1 Fetches 

Similarly to FETCH, GATH( 1) describes which portions have to be in local memory before 
entering l. However, it excludes portions which must already be locally available either 
by previous gathers or by previous calculations. Furthermore, we may not only exclude 
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these available data on a portion by portion basis, but also on an element by element basis. 
In other words, if we know that a portion x(ia(i mtn :i roax )) is buffered, then we might 
not only eliminate gathers of exactly that portion, but we can also save on a gather of a 
potentially overlapping portion x(ia(j mtn : j max )) by gathering only the increment from the 
first portion to the second one. 

For that purpose we compute incremental schedules using the \ operator as introduced 

in Section 2; recall that A \B contains exactly those references which appear in the portions 
in A but do not appear in any of the portions in B. Note that this operator, unlike the 
\, U, D used in the flow equations so far, brings us out of the fixed space of sets of portions 
appearing in the program text, and applying it repeatedly can lead to an explosion of the 
number portions we have to be able to represent (nestings of increments of intersections of 
increments, etc.). Applying this operator just once, however, leads to sets which can still 
be represented by 3-valued “bit” vectors/matrices; in addition to inc.luded/not included , we 
also need explicitly excluded. 

Note also that A\B = 0 is possible even for A \ B ^ 0. This reflects for example the 
case where we express a mesh and its boundary as different portions of the same array; the 
portions are distinct, but one contains a subset of the other. 

The equation: 

GATH(l) = FETCH (l) \ 

D (FETCH (p) U BUFFD(p )). 

p£preds(l ) 


5*2 Stores 

The SC ATT variables are derived from the STORE variables, except that we eliminate 
unnecessary scatters by excluding portions which either will be scattered later, or which are 

not at least potentially live (using HOUT any ). Again, we use the set operator \ to support 
incremental schedules. 

SC ATT [l) = STORE (l) \ 

p| (STORE(s) U HOUT any (s )). 

$€succ$(l) 

Note that we can still override the communication patterns obtained by global analysis 
for GATH and SCATT by just substituting the local counterparts GET and PUT for them. 
Furthermore, this can be done for either both variables or just one of them, since they do 
not rely on each other, but merely on the loop properties. 

5*3 Schedules 

The framework described so far gives an accurate description of which schedules are needed 
where. Critical for the overall cost associated with our communications is also the generation 
of these schedules, in particular where the schedules are generated. However, once we know 
the communication requirements, schedule computation placement appears to be relatively 
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straightforward. Therefore, we currently use the simple heuristic of generating schedules as 
soon as possible, i.e ., as soon as the necessary index arrays are available. This seems to work 
well in the codes we have considered so far. 

6 Reduction Variables 

As indicated earlier, the framework developed so far can be extended to take advantage of 
reduction statements as well. The portions exclusively appearing in reduction statements 
can be treated differently from other defs and uses, since they are not necessarily brought 
into local memory if we use reduction operations like SCATTERADD or SCATTERMULT. 
However, portions appearing in different reduction operations within one loop have to be 
brought into local memory, so we have to carefully separate the portions into the ones used 
exclusively in ADD reductions and the ones used only in MULT reductions: 

ADD(l) = {p | all q 6 p u are only added to in /}, 

MULT (l ) = {p | all q € p u are only multiplied to in l}. 

We derive RED, the set of all portions which are used exclusively in reduction operations, 
and redefine GET and PUT which were introduced in Section 3: 

RED(l) = ADD(l) U MULT(l), 

GET(l) = REF(l) \ RED{1), 

PUT(l) = DEF(T) \ RED(l). 

The changes so far have eliminated the GATHER’s and SCATTER’s for portions which 
appear exclusively in reductions. 

We now define another, separate framework, which computes only the SCATTER-ADD’s 
(similarly for the other reductions). This ADD framework coexists with the non-reduction 
framework which is still used to compute communication requirements for non-reduction 
operations. The redefined variables are: 

GET ADD (l) = REF(l) \ ADD(l), 

FLUSH add{1) = IND(l) U GET 0 add{1), 

STORE ADD (l) = ADD(l)U f) 

pEpreds(l) 

(DELAY ADD { P ) n STORE ADD (p)). 

Corresponding to these new variables, we can derive HOUT a A pjf U , DELAY add-, 

and SCATT A dd with the same equations as for the non-reduction framework. SCATT A dd 
now indicates where to place SCATTER_ADD’s. 

Like for the non-reduction framework, we can override the result variable selectively 
with their local counterpart, which is here ADD. Note that the flow equations for ADD are 
defined independently of other reductions. This simplifies extending the framework to other 
reduction operations by just adding flow variables and equations, without having to modify 
existing ones (except extending RED). 
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GATHER(z(nfl ( 1 :nf )) ,z(nf2( 1 :nf ))) 
do iii = 1 , itime 

GATHER(y(iel ( 1 :ne)),y(ie2( 1 :ne)), 
y(nfl(l:nf)),y(nf2(l:nf))) 
do i = 1, ne 

x (ie 1 (i )) = x(iel(i)) + y(ie2(i)) 
x(ie2(i)) = x(ie2(i)) + y(iel(i)) 

enddo 

do j = 1 , nf 

x(nfl(j)) = x(nfl(j)) + y(nf2(j)) + z(nf2(j)) 
x(nf2(j)) = x(nf2(j)) + y(nfl(j)) + z(nfl(j)) 

enddo 
do k = 1, ne 

x(iel(k)) = x(iel(k)) + y(ie2(k)) 
x(ie2(k)) = x(ie2(k)) + y(iel(k)) 

enddo 

SCATTERADD(x(iel( 1 :ne)),x(ie2( 1 :ne')), 
x(nfl(l:nf)),x(nf2(l:nf))) 
do 1 = 1, nn 

y(0 = x(i) 

enddo 

enddo 

Figure 1 ; Example code, communication is already inserted as derived by the framework. 

7 Example 

Figure 1 shows an example code. In this program, we have 

• four inner loops, l\, / 2 , I3, and Z 4 ; 

• one entry and one exit pad, Iq and Z 5 ; 

• three array names, x, y , and z] 

• five index sets, S] = eel (1 : ne), s 2 = ie 2(1 : ne), S3 = if 1(1 : n/), s 4 = if 2(1 : n/), 
and S5 = identity( 1 : nn); 

• this spans a bit matrix of fifteen portions, x\ = x(si), ar 2 = ^(^2)1 • ■ • 1 ~5 = ^(■Ss)? twelve 
of which actually occur in the program text. 

The corresponding flow graph is shown in Figure 2. 

The bit matrices of the resulting local flow variables are shown in Figure 3 . A matrix 
entry for a particular portion p and a flow variable VAR is defined as follows: 

“1” - p is included in VAR, 

- p is not included in VAR, 
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Figure 2: Flow graph for example code. 


“0” - p is explicitly excluded from VAR (as a result of the \ operator; in our example, there 
are none such entries due to the simple control flow structure). 

Figure 4 shows the global and result variables. Figure 5 shows the variables for the ADD 
framework. The result variables, i.e.GATH and SCATT , determine where the GATHER 
and SCATTER operations should be placed. If the bit representing a portion is set in the 
GATH set, then a GATHER operation for that portion is placed at the beginning of that 
loop. Similarly, a set bit in the SCATT set results in placement of a SCATTER operation 
(SCATTER_ADD in the ADD framework) at the end of a loop. GATHER’s and SCATTER’s 
of portions with identity as the index array are ignored. This is valid because they represent 
data movement from a processor to itself. 

We do not show here the optimizations needed to generate the schedule operations (t.e., 
the inspectors). In general, the method is to identify the index sets used, and insert the 
inspectors at the birthpoints of those sets. The first step can be done by inspection, while 
the second is a simple application of reaching definition analysis. 


8 Experimental Results 

We summarize the results of some of the experiments we have carried out to evaluate the per- 
formance impact of our optimizations. The experiments employed an explicit unstructured 
mesh solver of the three dimensional Euler equations which comprise a non-linear system of 
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Figure 3: Local flow variables for example code. 

five differential equations. The calculation consists of a sequence of loops over edges, bound- 
ary faces and nodes of an unstructured mesh. The code was originally developed by Dimitri 
Mavriplis. The program was ported to the Touchstone Delta using Parti primitives [1, 2] 
and the code was run to simulate a variety of aircraft configurations under a range of test 
conditions. While the port was carried out by hand, the strategy used to place the PARTI 
primitives was the same as the strategies that would result from our dataflow framework. 
We executed a number of different versions of the parallel Euler solver to give an idea of how 
the code generated using the dataflow framework would affect performance. 

The reader should note that the simple test loops presented in Section 7 were for ease 
of exposition, our experimental work involved a full unstructured explicit Euler solver. The 
example code shown in Figure 1 depicts loops which are motivated by the actual Euler solver. 
The loop structure of the example code can be derived from the actual solver by inlining the 
function calls. The main loop in the exampleis analogous to a time-stepping loop. Arrays 
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Figure 4: Global flow variables and result flow variables for example code. 
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Figure 5: Flow variables for ADD framework of example code. 

in the actual Euler solver required GATHER’ing and SCATTER’ing at different levels, as 
illustrated by the y and z arrays in the example. Unlike our example, the arrays in the actual 
code are multidimensional; incorporating this into our framework only requires treating the 
additional dimension as a regularly-accessed array. (This dimension has a size of five, and 
all elements in that dimension are set together.) There are three different kinds of loops 
present: two over the edges (/i and / 3), one over the boundary faces ( I 2 ) and one over the 
nodes (/ 4 ). These are typical of the variety of loop types found in the actual code. In 
summary, Figure 1 abstracts many of the complex parts of the actual code (such as control 
flow and irregular computations), while ignoring the straightforward parts (such as short 
vectors in other dimensions). 

The test case we report here involves the computation of a highly resolved flow over a 
three-dimensional aircraft configuration. The mesh contained a total of 804,056 points and 
approximately 4.5 million tetrahedra. We believe this to be the largest unstructured grid 
Euler solution attempted to date. In Figure 6, we depict one of the meshes used in our 
experimentation (we do not show the 804K mesh due to printing and resolution limitations). 
For this case, the freestream Mach number is 0.768 and the incidence is 1.16 degrees. We 
employed the recursive spectral partitioning algorithm to carry out partitioning [8, 10]. 
Partitioning was performed on a sequential machine as a preprocessing operation. 

We present timings that result from four different implementations of the Euler solver. In 
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Figure 6: Coarse Unstructured Mesh about an Aircraft Configuration with Single Nacelle; 
Number of Points = 106,064, Number of Tetrahedra = 575,986. 

two cases, we employ incremental scheduling and aggressively combine and hoist gather and 
accumulate procedure calls. In the two other cases, we do not make use of the knowledge we 
would obtain through global dataflow analysis in placement of gather, scatter and accumulate 
procedure calls. In these versions we place gather procedures immediately before loops that 
contain irregular array references. The use of incremental scheduling leads to a large amount 
of live data reuse. For instance, one step of Runge Kutta integration in the experimental 
code uses the flow variables in a sequence of three loops over edges followed by a loop over 
boundary faces. The flow variables are only updated at the end of the entire integration, 
rather than after each loop. We can obtain all of the off-processor flow variables needed at 
the beginning of the step. 

The indirection arrays (in Figure 1 the arrays iel, ie2, ifl, if2), with which we 
form our schedules, usually do not get written into in non-adaptive calculations. In such 
situations one can form the communication schedules before any actual computation begins, 
z.e., in the preprocessing stage. In our experimental code, the edges and the boundary faces 
of the mesh are fixed throughout the computation. Hence, we can very easily move the 
formation of the schedule outside the main iteration loop, as the indirection arrays are live 
throughout loop body. The only way the generation of the schedules (i.e., the inspectors) 
can be moved outside a block of code is if it can be ascertained that the indirection array, 
with which the schedule is being formed, is live inside the whole block. This can only be 
determined through global dataflow analysis. 

We present timings that result from generating schedules as soon as the necessary in- 
dex arrays are generated and timings that result from generating a new schedule for each 
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Scheduling Method 

Time per 
Iteration 
(seconds) 

Perfor- 

mance 

(Mflops) 

Preprocessing 

Time 

(seconds) 

Not incremental, 
inside of main loop 

6.91 

573 

273 

Not incremental, 
outside of main loop 

4.18 

947 

2.73 

Incremental, 
inside of main loop 

5.64 

702 

299 

Incremental, 
outside of main loop 

2.65 

1496 

2.99 


Table 1: Explicit Unstructured Euler Solver on 804K Mesh on 512 Delta Processors 

invocation of each gather, scatter or accumulate procedure call. 

Table 1 depicts: 

• The time required per iteration, 

• The computational rate in Mflops, and 

• The preprocessing time needed per iteration for generating all communication sched- 
ules. 

Comparing the times for scheduling inside the loop and scheduling only once in the compu- 
tation, we see performance improvements ranging from 65% to over 100% The preprocessing 
time increases only modestly when we use incremental scheduling and is roughly equal to 
the cost of a single parallelized iteration. Once we have hoisted schedule generation outside 
the main iteration loop, use of incremental scheduling leads to an additional 58% reduction 
in total time. 


9 Conclusions 

Communicating the right data at the right time and place is a difficult, yet crucial task 
for parallelizing irregular problems. The PARTI primitives are valuable tools for the first 
part of the problem, namely for determining where to find which data and for efficient 
data exchange. The dataflow framework presented in this paper is designed for attacking 
the second part of the problem, namely enabling the compiler to make good use of these 
primitives without further advice by the user. We believe our approach to be effective for a 
wide range of interesting problems, as illustrated for an explicit unstructured mesh solver. 
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